Dealing with Failures During Failure Recovery of Distributed Systems ; CU-CS-1009-06

نویسندگان

Naveed Arshad

Dennis Heimbigner

Alexander L. Wolf

Alexander Wolf

چکیده

One of the characteristics of autonomic systems is self recovery from failures. Self recovery can be achieved through sensing failures, planning for recovery and executing the recovery plan to bring the system back to a normal state. For various reasons, however, additional failures are possible during the process of recovering from the initial failure. Handling such secondary failures is important because they can cause the original recovery plan to fail and can leave the system in a complicated state that is worse than before. In this paper techniques are identified to preserve consistency while dealing with such failures that occur during failure recovery.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Checkpointing and Rollback Recovery in Distributed Systems: Existing Solutions, Open Issues and Proposed Solutions

Checkpointing and rollback recovery are wellestablished techniques for dealing with failures in distributed systems. In this paper, we briefly summarize the existing solution approaches to these problems and also discuss the open issues, suggested approaches and some preliminary work that we have done addressing the open issues.

متن کامل

Distributed-System Failures: Observations and Implications for Testing ; CU-CS-994-05

Distributed software systems are notoriously difficult to test. As in all software testing, the space of potential test cases for distributed systems is intractably large and so the efforts of testers must be directed to the scenarios that are most important. Unfortunately, there does not currently exist a general-purpose, disciplined, and effective testing method for distributed systems. In th...

متن کامل

MTBF evaluation for 2-out-of-3 redundant repairable systems with common cause and cascade failures considering fuzzy rates for failures and repair: a case study of a centrifugal water pumping system

In many cases, redundant systems are beset by both independent and dependent failures. Ignoring dependent variables in MTBF evaluation of redundant systems hastens the occurrence of failure, causing it to take place before the expected time, hence decreasing safety and creating irreversible damages. Common cause failure (CCF) and cascading failure are two varieties of dependent failures, both l...

متن کامل

PREFAIL: Programmable and Efficient Failure Testing Framework

With the arrival of the cloud computing era, largescale distributed systems are increasingly in use. These systems are built out of tens of thousands of commodity machines that are not fully reliable and can fail from time to time [1, 2, 7, 10, 14, 15]. Thus, the software that runs on these systems has a great responsibility to correctly recover from frequent, diverse hardware failures. Even if...

متن کامل

A Proposal to investigate the use of error correcting code techniques in implementing distributed systems resistant to Byzantine failures and security breaches

Throughout the literature on reliable distributed systems there is much coverage of systems which maintain correct operations in the face of fail-stop or non-Byzantine failures. What are less represented are methods for dealing with the harder problem of Byzantine failures. This paper proposes a method for dealing with these sort of failures. Fail-stop or non-Byzantine failures typically are ch...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Dealing with Failures During Failure Recovery of Distributed Systems ; CU-CS-1009-06

نویسندگان

چکیده

منابع مشابه

Checkpointing and Rollback Recovery in Distributed Systems: Existing Solutions, Open Issues and Proposed Solutions

Distributed-System Failures: Observations and Implications for Testing ; CU-CS-994-05

MTBF evaluation for 2-out-of-3 redundant repairable systems with common cause and cascade failures considering fuzzy rates for failures and repair: a case study of a centrifugal water pumping system

PREFAIL: Programmable and Efficient Failure Testing Framework

A Proposal to investigate the use of error correcting code techniques in implementing distributed systems resistant to Byzantine failures and security breaches

عنوان ژورنال:

اشتراک گذاری